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Abstract 

A large portion of the power budget in server environ- 
ments goes into the I/O subsystem - the disk array in par- 
ticular. Traditional approaches to disk power management 
involve completely stopping the disk rotation, which can 
take a considerable amount of time, making them less use- 
ful in cases where idle times between disk requests may 
not be long enough to outweigh the overheads. This pa- 
per presents a new approach called DRPM to modulate 
disk speed (RPM) dynamically, and gives a practical imple- 
mentation to exploit this mechanism. Extensive simulations 
with different workload and hardware parameters show that 
DRPM can provide significant energy savings without com- 
promising much on performance. This paper also discusses 
practical issues when implementing DkPM on server disks. 

Keywords: Server Disks, Power Management. 



1 Introduction 

Data-centric services - file and media servers, web and 
e-commerce applications, and transaction processing sys- 
tems to name a few - have become commonplace in the 
computing environments of large and small business en- 
terprises, as well as research and academic institutions. In 
addition, other data-centric services such as search engines 
and data repositories on the Internet are sustaining the needs 
of thousands of users each day. The commercial conse- 
quences of the performance and/or disruption of such ser- 
vices have made performance, reliability and availability 
the main targets tor optimization traditionally. However, 
power consumption is increasingly becoming a major con- 
cern in these systems [4, 2]. Optimizing Tor power has 
been understood to be important for extending battery life 
in embedded/mobile systems. It is only recently that the im- 
portance of power optimization in server environments has 
gained interest because of the cost of power delivery, cost 
of cooling the system components, and the impact of high 
operating temperatures on the stability and reliability of the 
components. 

Several recent studies have pointed out that data cen- 
ters can consume several Mega- watts of power [5]. It has 
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been observed [5] that power densities of data centers could 
grow to over 100 Watts per square foot and that the ca- 
pacity of new data centers for 2005 could require nearly 
40 TWh (around $4B) per year. A considerable portion 
of this power budget on these servers is expected to be 
taken up by the disk subsystem, wherein a large number 
of disks are employed to nandle the load and storage ca- 
pacity requirements. Typically, some kind of I/O paral- 
lelism (RAID [31]) is employed to sustain the high band- 
width/throughput needs or such applications which are in- 
herently data-centric. While one could keep adding disks 
for this purpose, at some point the consequent costs of in- 
creasing power consumption may overshadow the benefits 
in performance. Disks even when idle (spinning but not 
performing an operation), can drain a significant amount of 
power. For instance, a server class IBM Ultrastar 36ZX [17] 
disk is rated at 22.3 W (compare this to an Intel Xeon pro- 
cessor clocked at 1 .6 GHz which is rated at 57.8 W). When 
we go to specific server configurations (e.g. a 4-way Intel 
Xeon SMP clocked at 1 .6 GHz with 140 disks drawn from 
[37]), the disks consume 13.5 times more power than the 
processors. 

One possible solution is to use a large cache, under the 
assumption that the I/O workload will exhibit good locality. 
Caching can also potentially be used to delay writes as pro- 
posed by Colarelh et al [6] for archival and backup systems. 
However, in most servers, though large caches are common, 
they are typically used for prefetching to hide disk latencies, 
since not all server workloads exhibit high temporal locality 
to effectively use the cache. Prefetching does not reduce the 
power consumption of the disks. 

Another way of alleviating this problem is by shutting 
down disks or at least stop them from spinning since the 
spindle motor for the disks consume most of the power as 
will be described in section 3. Many disks offer different 
power modes and one could choose to transition them to a 
low power mode when not in use (idle) which achieves this 
functionality (e.g. stop the spinning). Such techniques have 
been effectively used [7, 26, 14, 27, 9, 39, 30, 13] in lap- 
top environments where the goal is mainly to save battery 
energy. When in a low power mode, the disk needs to be 
spun up to full speed before a request can be serviced, and 
this latency is much more critical in servers than in a laptop 
setting. Further, the application of such traditional mode- 
control techniques for server environments is challenging, 
where one may not have enough idleness and where perfor- 
mance is more critical [11]. 

From the above discussion, we see two extremes in op- 



eration - one that is performance efficient with disks spin- 
ning all the time, and the other where the goal is power 
optimization by stopping the spinning of the disk whenever 
there is a chance at the cost or performance. In this paper, 
we present a new option - Dynamic Rotations Per Minute 
(DRPM) - where one could choose to dynamically operate 
between these extremes, and adaptively move to whichever 
criterion is more important at any time. The basic idea is 
to dynamically modulate the speed at which the disk spins 
(RPM), thereby controlling the power expended in the spin- 
dle motor driving the platters. Slowing down the speed of 
spinning the platters can potentially provide quadratic (with 
respect to the change in RPM) power savings. However, a 
lower RPM can hurt rotational latencies and transfer costs 
when servicing a request (at best linearly). In addition to 
these rotational latencies and transfer costs, disk accesses 
incur seek overheads to position the head to the appropriate 
track, and this is not impacted by the RPM. Consequently, it 
is possible to benefit more from power than one may loose 
in performance from such RPM modulations. This DRPM 
mechanism provides the following benefits over the tradi- 
tional power mode control techniques [26, 27] (referred to 
as TPM in this paper): 

o Since TPM may need a lot more time to spin down the 
disk, remain in the low power mode and then spin the 
disk back up, there may not be a sufficient duration of 
idleness to cover all this time without delaying subse- 
quent disk requests. On the other hand, DRPM does 
not need to fully spin down the disk, and can move 
down to a lower RPM and then back up again, if re- 
quired, in a shorter time (RPM change costs are more 
or less linear with the amplitude of the change). The 
system can service requests more readily when they 
arrive. 

o The disk does not necessarily have to be spun back up 
to its full speed before servicing a request as is done 
in TPM. One could choose to spin it up if needed to a 
higher speed than what it is at currently (taking lower 
time than getting it from 0 RPM to full speed), or ser- 
vice the request at the current speed itself. While opt- 
ing to service the request at a speed less than the full 
speed may stretch the request service time, the exit la- 
tency from a lower power mode would be much lower 
than in TPM. 

o DRPM provides the flexibility of dynamically choos- 
ing the operating point in power-performance trade- 
offs.' It allows the server to use state-of-the-art disks 
(fastest in the market) and provides the ability to mod- 
ulate their power when needed without having to live 
with a static choice of slower disks. It also provides a 
larger continuum of operating points for servicing re- 
quests than the two extremes of full speed or 0 RPM. 
This allows the disk subsystem to adapt itself to the 
load imposed on it to save energy and still provide the 
performance that is expected of it. 

The DRPM approach is somewhat analogous to volt- 
age/frequency scaling [32] in integrated circuits which pro- 
vides more operating points for power-performance trade- 
offs than an on/off operation capability. A lower voltage 
(usually accompanied with a slower clock for letting cir- 
cuits stabilize) provides quadratic power savings and the 
slower clock stretches response time linearly, thus provid- 
ing energy savings during the overall execution. This is the 
first paper to propose ana investigate a similar idea for disk 
power management. 

The primary contribution of this paper is the DRPM 
mechanism itself, where we identify already available tech- 
nology that allows disks to support multiple RPMs. More 



importantly, we develop a performance and power model 
for such disks based on this technology showing how costs 
for dynamic RPM changes can be modeled. 

The rest of this paper looks at evaluating this mechanism 
across different workload behaviors. We Tirst look at how 
well an optimal algorithm, called DRPM per / (that pro- 
vides the maximum energy savings without any degrada- 
tion in performance) performs under different workloads, 
and compare its pros and cons with an optimal version 
of TPM, called TPM per f (which provides the maximum 
power savings for TPM without any degradation in perfor- 
mance). When the load is extremely nigh (i.e. there are 
very little few periods), there is not much that can be done 
in terms of power savings if one does not want to compro- 
mise at all on performance, regardless of what technique 
one may want to use. At the other end of the spectrum, when 
there are very large idle periods, we find TPM pre f provid- 
ing good power savings as is to be expected since it com- 
pletely stops spinning the disks as opposed to DRPM per /, 
which keeps them spinning albeit at a slow speed. How- 
ever, there is a wide range of intermediate operating condi- 
tions when DRPM turns out to give much better (upto 60%) 
savings in the idle mode energy consumption, even if one 
does not wish to compromise at all on performance. It is 
also possible to integrate the DRPM and TPM approaches, 
wherein one could use TPM when idle times are very long 
and DRPM otherwise. 

Finally, this paper presents a simple heuristic that dy- 
namically modulates disk speed using the DRPM mecha- 
nism and evaluates how well it performs with respect to 
DRPM per f where one has perfect knowledge of the future. 
One could modulate this algorithm by setting tolerance lev- 
els for degradation in response times, to amplify the power 
savings. We find that this solution comes fairly close to the 
power savings of DRPM per f (which does not incur any 
response time degradation) without significant penalties in 
response time, and can sometimes even do better in terms 
of power savings. 

The rest of this paper is organized as follows. The next 
section gives an overview of the sources of energy con- 
sumption in a disk and prior techniques for power optimiza- 
tion. Section 3 presents the DRPM mechanism and the cost 
models for its implementation. Section 4 gives the experi- 
mental setup and section 5 gives results with DRPm ptT f, 
comparing its potential with TPM and conducts a sensitiv- 
ity analysis. The details of our heuristic for online speed 
setting and its evaluation are given in section 6. Section 7 
discusses some issues that arise when implementing a real 
DRPM disk. Finally, section 8 summarizes the contribu- 
tions of this paper. 



2 Disk Power and TPM 



There are several components at the disk that contribute 
to its overall power consumption. These include the spindle 
motor which is responsible for spinning the platters, the ac- 
tuator which is responsible for the head movements (seeks), 
the electrical components that are involved in the transfer 
operations, the disk cache, and other electronic circuitry. Of 
these, the first two are mechanical components and typically 
overshadow the others, and of these the spindle motor is the 
most dominant. Studies of power measurements on differ- 
ent disks have shown that the spindle motor accounts for 
nearly 50% of the overall idle power for a two-platter disk, 
and this can be as high as 81.34% for a ten-platter server 
class disk [12]. Consequently, traditional power manage- 
ment techniques at the higher level focus on addressing this 
issue by shutting down this motor when not in active use. 



Disk power management has been extensively studied in 
the context of single disk systems, particularly for the lap- 
top/desktop environment. Many current disks offer differ- 
ent power modes of operation, such as active - when the 
disk is servicing a request, idle - when it is spinning and 
not serving a request, and one or more low power modes 
that consume less energy than idle (where the disk platters 
do not spin). Managing the energy consumption of the disk 
consists of two steps, namely, detecting suitable idle periods 
and then spinning down the disk to a low power mode when- 
ever it is predicted that the action woula save energy. De- 
tection of idle periods usually involves tracking some kind 
of history to make predictions on how long the next idle pe- 
riod would last. If this period is long enough (to outweigh 
spindown/spinup costs), the disk is explicitly spun down to 
the low power mode. When an I/O request comes to a disk 
in the spundown state, the disk first needs to be spun up to 
service this request (incurring additional exit latencies and 
power costs in the process). One could pro-actively spin 
up the disk ahead of the next request if predictions can be 
made accurately, but many prior studies nave not done this. 
Many idle time predictors use a time-threshold to find out 
the duration of the next idle period. A fixed threshold is 
used in [26], wherein if the idle period lasts over 2 seconds, 
the disk is spun down, and spun back up only when the next 
request arrives. The threshold could itself be varied adap- 
tively over the execution of the program [7, 14]. A detailed 
study of idle-time predictors and their effectiveness in disk 
power management has been conducted in [9]. Luetal. [27] 
provide an experimental comparison of several disk power 
management schemes proposed in literature on a single disk 
platform. 

We broadly refer to these previous power mode-control 
mechanisms as TPM in this paper. It is to be noted that 
TPM has the disk spinning at either its full speed or fully 
stationary and does not allow intermediate RPMs. 

Another power saving approach, though orthogonal to 
this work, is to replace a single disk with multiple smaller 
form-factor disks that consume lower power as in [41]. 

3 Dynamic RPM (DRPM) 

The TPM techniques (and our DRPM mechanism) can 
be used in conjunction with other techniques that can re- 
duce disk accesses (by aggregation and/or caching using 
hardware/OS/application support) or place data to reduce 
head movements [ 1 5] (which can save actuator power) to 
further the power savings. However, as was mentioned ear- 
lier, the spindle motor power needed to spin the disks is still 
the major power consumer [12], which is expended even 
when the disk is not serving a request (and is spinning). As 
can be seen in Figure 1 , which shows the RPMs and power 
consumption of different IBM server class disks over the 
years, there appears to be a strong correlation between the 
rotational speed and the idle power in these disks (though 
it should be noted that RPM is not the only variation across 
the technologies employed). This motivates us to investi- 
gate the possibility of modulating the RPM dynamically to 
adjust the power consumption. 

3.1 Basics of Disk Spindle Motors 

A detailed exposition of disk spindle motors (SPMs) can 
be found in [20, 351. Disk SPMs are permanent magnet DC 
brush less motors. In order to operate as a brush less motor, 
sensors are required inside the motor to provide the pulses 
necessary for commutation (i.e., rotation). These sensors 
may either be Hall-Effect sensors or back-EMF sensors. 




Figure 1. IBM Server Disks - Idle Power Con- 
sumption. For each disk, the form-factor 
was fixed at 3.5" and the largest capacity 
configuration was chosen. The idle power 
is relatively independent of the form-factor. 
[12, 35] We found that the idle power was not 
that strongly related to the capacity. For in- 
stance, two other IBM disks, the Ultrastar 9ZX 
and the Ultrastar 18ZX, are both 10,000 RPM 
disks with 9.1 and 18.2 GB capacity respec- 
tively, while their idle power consumption is 
16.5 W and 16.3 W respectively. 



Speed control of the motors can be achieved by using Pulse- 
Width Modulation (PWM) techniques, which make use of 
the data from the sensors. 

A large accelerating torque is first needed to make the 
disks start spinning. This high torque is essentially required 
to overcome the stiction forces caused by the heads sticking 
to the surface of the disk platter. The use of technologies 
like Load/Unload [18] can ameliorate this problem by lift- 
ing the disk-arm from the surface of the platter. These tech- 
nologies also provide power benefits and are used for exam- 
ple in IBM hard disk drives [18] to implement the special 
IDLE-3 mode. In addition to providing the starting torque, 
the SPM also needs to sustain its RPM once it reaches the 
intended speed. 

One traditional approach in improving disk performance 
over the years has oeen to increase the RPM (which re- 
duces rotational latencies and transfer times), which can 
prove beneficial in bandwidth-bound applications. How- 
ever, such increases can cause concern in additional is- 
sues such as noise and Non-Repeatable Run-Outs (NRROs). 
(NRROs are off-track errors that can occur at higher RPMs, 
especially at high track densities.) These design considera- 
tions in the development of high RPM disks have been ad- 
dressed by the use of advanced motor-bearing technologies 
like fluid [1, 16] and air-bearings [38J. 

However, the power associated with high RPMs still re- 
mains and this paper focuses on this specific aspect. 

3.2 Analytical Formulations for Motor Dynamics 

Our DRPM solution dynamically controls the spindle 
motor to change the RPM of the spinning platters. The 
RPM-selection capability can be provided oy allowing the 
spindle-motor control block [36] of the hard-disk controller 
[21] to be programmable. For example, the desired RPM 
can be input via a programmable register in the hard-disk 
controller. The value read from this register can inturn be 
used by the spindle-motor driver [3] to generate the requi- 
site signals for operating the disk at that RPM. 



We now present the time overhead needed to effect an 
RPM change, and the power of the resulting state as a func- 
tion of the RPM. 



3.2.1 Calculating RPM TVansition Times 

In order to calculate the time required for a speed-change, 
we need some physical data of the spindle-motor. This in- 
formation for a specific disk is usually proprietary, but there 
are DC brushless motors commercially available that we 
can use for this purpose. We have obtained the necessary in- 
formation from the datasheet of a Maxon EC-20 20 mm flat 
brushless permanent magnet DC motor [28], whose physi- 
cal characteristics closely match those of a hard disk spindle 
motor. Table 1 summarizes the basic mechanical character- 
istics of this motor. 



Parameter 


Value 


Units i 


Max. Permissible Speed 


15000 


rpm 


Rotor Inertia (Jo) 


3.84 


gem 2 


Torque Constant (Kt) 


9.1 


mNm/A 


Max. Continuous Current at 12K rpm (/) 


0.708 


A 



Table 1. Maxon EC-20 Motor Characteristics 

The motor specifications give a formula for calculating 
the time At (in ms) required tor a speed-change of An RPM 
with a load inertia J L as: 

300 K T I 

The load on the spindle motor is the platter assembly. 
We dismantled^ 3.5 Quantum hard disk, and measured 
the weight of an individual platter using a sensitive balance 
and also its radius. Its weight m was found to be 14.65 
gm and radius r was 4.7498 cm. Using these values, and 
assuming 10 platters per disk (as in [17], though we also 
have sensitivity results for different number of platters), we 
calculated the moment of inertia of the load Ji (in gem 2 ) 
as: 

Jl = n p \mr 2 = 10 x \ x 14.65 x (4.7498) 2 
where n p is the number of platters. 

=> J L = 1652.563pcm 2 

Therefore, we have 

At = 2.693 x 10" 4 An (1) 

This shows that the time cost of changing the RPM of 
the disk is directly proportional (linear) to the amplitude of 
the RPM change. 

3.2.2 Calculating the Power Consumption at an RPM 
Level 

We briefy explain the dependence between the power con- 
sumption of a motor and its rotation speed. A detailed expo- 
sition of this topic can be found in [24]. The motor voltage 
V (also called the back electromotive force or back-emf) is 
related to the angular velocity (rotation-speed) u) as 

V = K E u 




Figure 2. Current Drawn by Sony Multimode 
Hard Disk 



where Ke is called the back-emf constant. The power con- 
sumed by the motor, P, is 

V 2 

P = y/ = - 

where R is the resistance of the motor. Therefore, we have 



P = 



R 



(2) 



This equation, similar to that relating the power and voltage 
for CMOS circuits, indicates that a change in the rotation- 
speed of the disk has a quadratic effect on its power con- 
sumption. In order to investigate whether this relationship 
holas true in the context of a hard disk, we used an exper- 
imental curve-fitting approach. There exists a commercial 
hard disk today - the Multimode Hard Disk Drive [29] from 
Sony - that indeed supports a variable speed spindle motor. 
The speed setting on such a disk is accomplished in a more 
static (pre-configured) fashion, rather than modulating this 
during the course of execution. The published current con- 
sumption values of this disk for different RPM values pro- 
vides insight on how the current drawn varies with the RPM. 

Figure 2 shows the current drawn by the SPM of the Mul- 
timode hard disk (repeated from [29]). A simple curve fit of 
these data points clearly shows this quadratic relationship. 
This relationship may appear somewhat different from the 
trends shown in Figure 1 where one could argue that the 
relationship between RPM and power is linear. However, 
Figure 1 snows the trend for disks of different generations, 
where it is not only the RPM that changes but the hardware 
itself. On the other hand, equation 2 and the Multimode 
hard disk current consumption profile (Figure 2) shows the 
relation between these two parameters is more quadratic for 
an individual disk drive. 

This Multimode disk is composed of only two platters, 
while we are looking at server class disks that have several 
more platters (8-10 platters). Consequently, we cannot di- 
rectly apply this model to our environment. On the other 
hand, a study from IBM [35] projects the relation between 
idle power and RPM for 3.5 server class IBM disks. Note 
that in these disks, other design factors such as the change 
in the number of disk-platters, nave been considered besides 
just the RPM to make the projections, and we depict the re- 
sults from there by the points shown in Figure 3. In our 
power modeling strategy for a variable RPM disk, we em- 
ployed two approaches, to capture a quadratic and linear 
relationship respectively: 
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Figure 3. Comparison of DRPM Model to the 
IBM Projections 



• We took the points from the IBM study [35] and used 
a quadratic curve to approximate their behavior as is 
shown by the solid curve in Figure 3 to model the idle 
power as 

P id ie = 1.318 x 10" 7 rpm 2 - 4.439 x 10 _4 rpm 
+ 8.643 (3) 

• We also performed a linear least squares fit through 
the points as is shown by the dotted fine in Figure 3 to 
moae! the idle power as 

P idU = 0.0013rpm + 4.158 (4) 

We have used both models in our experiments to study 
the potential of DRPM. In general, we find that the results 
are not very different in the ranges of RPMs that were stud- 
ied (between 3600 to 12000) as is evident even from Figure 
3 which shows the differences between linear and quadratic 
models within this range are not very significant. 

Equations 1 and 3/4 provide the necessary information 
in modeling the RPM transition costs and power dynamics 
of our DRPM strategy. For the power costs of transitioning 
from one RPM to another, we conservatively assume that 
the power during this time is the same as that of the higher 
RPM state. 



is very aggressive as the actual power consumption even in 
this mode is typically much higner (for example its value is 
12.72 W in the actual Ultrastar disk). However, as we shall 
later show, even with such a deep low-power standby mode 
that is used by TPM, DRPM surpasses TPM in terms of 
energy-benefits in several cases. Also, in our power models, 
we have accounted for the case that the power penalties for 
the active and seek modes (in addition to idle power) also 
depend upon the RPM of the disk. The modes exploited by 
TPM, including the power consumption of each mode and 
the transition costs are illustrated in Figure 4. 

We have considered several RPM operating levels, i.e. 
different resolutions for stepping up/down the speed of the 
spindle motor. These "step-sizes" are as low as 600 RPM, 
providing 13 steps between the extremes of 3600 and 12000 
RPM (15 RPM levels in all). The default configuration that 
we use in our experiments is a 1 2-disk RAID-5 array, with a 
quadratic DRPM power-model and a step-size of 600 RPM. 



Power Model Type Quadratic , Linear 
Minimum Disk Rotation Speed 3600 RPM 
RPM Step-Size | 600,2100 RPM 

Table 2. Simulation Parameters with the de- 
fault configurations underlined. Disk spinups 
and spindowns occur from 0 to 12000 RPM 
and vice-versa respectively. 



| Parameter | Value 

Parameters Common to TPM and DRPM 



Number of Disks in the Array 


1 2,24 


Stripe Size 


16KB 


RAID Level 


5,10 


Individual Disk Capacity 


33.6 GB 


Disk Cache Size 


4MB 


Max. Disk Rotation Speed 


12000 RPM 


Idle Power® 12000 RPM 


22.3 W 


Active (R/W) Power @ 1 2000 RPM 


39 W 


Seek Power® 12000 RPM 


39 W 


Standby Power 


4.15 W 


Spinup Power 


34.8 W 


Spinup Time 


26 sees. 


Spindown Time 


1 5 sees. 


Disk- Arm Scheduling 


Elevator 


Bus Type 


Ultra-3 SCSI 



DRPM-Specific Parameters 



4 Experimental Setup and Workload De- 
scription 

We conducted our evaluations using the DiskSim [8] 
simulator modeling a disk array for a server environ- 
ment. DiskSim provides a large number of timing and 
configuration parameters for specifying disks "and the con- 
trollers/buses for the I/O interface. The simulator was aug- 
mented with power models to record the energy consump- 
tion of the aisks when performing operations like data- 
transfers, seeks, or when just idling. Our DRPM implemen- 
tation accounts for the cmeuing and service delays caused 
by the changes in the RPM of the disks in the array. The 
default configuration parameters used in the simulations are 
given in Table 2, many of which have been taken from the 
data sheet of the IBM Ultrastar 36ZX [17] server hard disk. 
The power consumption of the standby mode was calcu- 
lated by setting the spindle motor power consumption to 0 
when calculating Pidte based on the method described in 
section 3. Note that this value for the power consumption 




Figure 4. TPM Power Modes 



Since we want to demonstrate the potential of the DRPM 
mechanism across a spectrum of operating conditions (dif- 
ferent loads, long idle periods, bursts of I/O requests, etc.) 
that server disks may experience, and to evaluate the pros 



and cons of DRPM over other power saving approaches, 
we chose to conduct this study with several synthetic work- 
loads where we could modulate such behavior. The syn- 
thetic workload generator injects a million I/O requests with 
different inter-arrival times, and request parameters (start- 
ing sector, request-size, and the type of access - read/write). 
An the workloads consist of 60% read requests and 20% of 
all requests are sequential in nature. These characteristics 
were chosen based on [341. Since a closed-system simula- 
tion may alter the injected load based on service times of 
the disk array for previous requests, we conducted an open- 
system simulation with these workloads. 

We considered two types of distributions for the inter- 
arrival times, namely, exponential and Pareto. As is well- 
understood, exponential arrivals model a purely random 
Poisson process, and to a large extent models a regular 
traffic arrival behavior (without burstiness). On the other 
hand, the Pareto distribution introduces burstiness in ar- 
rivals, which can be controlled. The Pareto distribution 
is characterized by two parameters, namely, a, called the 
shape-parameter, and ft, called the lower cutoff value (the 
smallest value a Pareto random-variable can take).' We 
chose a Pareto distribution with a finite mean and infinite 
variance. 

For both distributions, we varied the mean inter-arrival 
time (in ms) as a parameter. In Pareto, there are different 
ways by which the traffic can be generated for a given mean. 
We set the f) to 1 ms and varied a (i.e. when the mean is 
increased, the time between the bursts - idleness - tend to 
increase). 

We use the term workload to define the combination 
of the distribution that is being used and the mean inter- 
arrival time for this distribution. For instance, the workload 
<Par,10> denotes a Pareto traffic with a mean inter-arrival 
time of 10 ms. 

In general, statistics to differentiate between the schemes 
are collected after the initial start up effects. We compare 
the schemes for each workload using three metrics, namely, 
total energy consumption over all the requests (Et 0 t), idle- 
mode energy consumption overall the requests (Eidi e )> and 
response-time per I/O request (T). These can be defined as 
follows: 



• The total energy consumption (E to t) is the energy con- 
sumed by all the disks in the array from the beginning 
to the end of the simulation period. We monitor all the 
disk activity (states) and their duration in each state, 
and use this to calculate the overall energy consump- 
tion by the disks (integral of the power in each state 
over the duration in that state). 

• The idle-mode energy consumption (Eiau) is the en- 
ergy consumed by all the disks in the array while not 
servicing an I/O request (i.e., while not performing 
seeks or data-transfers). This value is directly im- 
pacted by the spinning-speed of the spindle motor. 

• The response-time (T) is the time between the request 
submission and the request completion averaged over 
all the requests. This directly has a bearing on the de- 
livered system throughput. 

Finally, we use the terms power and energy inter- 
changeably sometimes. 



Pareto probability distribution function is given by P(x) = 
, x > 0, a > 0 The mean is given by E(x) = fi^. 
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Figure 5. Breakdown of E tot for the differ- 
ent workloads. On the x-axis, each pair rep- 
resents a workload defined by probability 
Distribution, Mean Inter-Arrival Time> pair. 



5 Power Optimization without Performance 
Degradation 

5.1 Energy Breakdown of the Workloads 

Before we examine detailed results, it is important to un- 
derstand where power is being drained over the course of 
execution, i.e. when the disk is transferring data (Active), 
or positioning the head (Positioning) or when it is idling 
(Idle). Figure 5 gives the breakdown of energy consump- 
tion of two workloads from each of the inter-arrival time 
distributions - one at high and another at low load condi- 
tions - into these three components when there is no power 
saving technique employed. The high and low loads also 
indicate that idle periods are low and nigh respectively. 

As is to be expected, when the load is light 
(<Exp, 1000>, <Par,5U>), the idle energy is the most dom- 
inant component. However, we find that even when we 
move to high load conditions (<Exp,10>, <Par,5>), the 
idle energy is still the most significant of the three. While 
the positioning energy does become important at these high 
loads, the results suggest that most of the benefits to gain are 
from optimizing the idle power (in particular, the spindle 
motor component which consumes 81 .34% of this power). 
Consequently, our focus in the rest of this section is on the 
idle power component, by looking at how different schemes 
(TPM and DRPM) exploit the idleness for energy savings. 

5.2 The Potential Benefits of DRPM (DRPM perf ) 

The power saving, either with TPM or DRPM, is based 
on the idleness of disks between serving requests. While in 
the latter case, it is possible to get more savings by serv- 
ing requests at a lower RPM, this may result in perfor- 
mance degradation. In the first set of results, we do not 
consider this to be an option, i.e. we define a scheme called 
DRPMp er f whose performance is not any different from 
the original disk subsystem (which does not employ any 
power management technique). Further, to investigate what 
could be the potential of DRPM, we assume the existence 
of an idle-time prediction oracle, which can exactly predict 
when the next request will arrive after serving each request. 
Consequently, DKPM per f uses this prediction to find out 
how low an RPM it can go down to, and then come back up 
to full speed before servicing the next request (noting the 
times and energy required for doing such transitions). 




Figure 6. Savings in Idle Energy using 
TPM per f, DRPM per f, and Combined are pre- 
sented for the quadratic power model. 



To be fair, the same oracle can be used by TPM as well 
for effecting power mode transitions, and we call such a 
scheme TPM per /, where the disk is transitioned to the 
standby mode if the time to the next request is long enough 
to accommodate the spindown followed by a spinup. 

Note that DRPM per f can exploit much smaller idle 
times for power savings compared to TPM peT f. On the 
other hand, when the idle time is really long, TPM per f 
can save more energy by stopping the spinning completely 
(while DRPM can take it down to only 5600 RPM). There- 
fore, in order to investigate the potential benefits if both 
these techniques were used in conjunction, we have also 
considered a scheme called Combined. In the Combined 
scheme, we use the oracle to determine which of the two 
techniques saves the maximum energy, for each idle time- 
period. 

We would like to point out that DRPM per f, TPM perfi 
and Combined do not put a bound on the energy savings 
that one can ever get. Rather, they give a bound when per- 
formance cannot be compromised. Figure 6 presents the 
idle energy savings (which was shown to be the major con- 
tributor ot overall energy) for these schemes as a function 
of the inter-arrival times in the two distributions. 

When we first examine the exponential traffic results, we 
note that the results confirm our earlier discussion wherein 
large inter-arrival times favor TPM per f . At the other end of 
the spectrum, when inter-arrival times get very small, there 
is not really much scope for any of these schemes to save 
energy if performance compromise is not an option. How- 
ever, Between these two extremes, we find that I)RPM per f 
provides much higher savings than TPM per f. It finds more 
idle time opportunities to transition to a lower RPM mode, 
which may not be long enough for TPM. As is to be ex- 
pected, the combined scheme approaches the better of the 
two across the spectrum of- workloads. ...... 

When we next look at the Pareto traffic results, we find 
that the arrivals are fast enough (due to burstiness of this 
distribution) even at the higher mean values considered 
that DRPM per f consistently outperforms TPM per f in the 
range under consideration. It is also this reason that makes 
the energy savings of all the schemes with this traffic distri- 
bution lower than that for exponential where the idle times 
are less varying. 

The purpose of this exercise was to examine the potential 
of DRPM with respect to TPM while not compromising on 
performance. The rest of this section looks to understand- 
ing the sensitivity of the power savings with this approach 
to different hardware and workload parameters. Since the 



sensitivity of DRPM is more prominent at the intermediate 
load conditions (where it was shown to give better savings 
than TPM), we focus more on those regions in the rest of 
this paper. 

5.3 Sensitivity Analysis of DRPM perf 



5.3.1 Number of Platters 

Disks show significant variability in platter counts. At one 
end, the laptop disks have 1 or 2platters, while server class 
disks can nave as many as 8-10 platters. The number of 
platters has a consequence on the weight imposed on the 
spindle motor, which has to spin them as was described ear- 
lier. In Figure 7 the effect of three different platter counts 
(4, 10 and 16) has been shown for the two types of traf- 
fic with different load conditions. It can be seen that as 
the number of platters increases, the savings drop. This is 
because a larger weight is imposed on the spindle motor, re- 
quiring a higher torque for RPM changes thereby incurring 
more overheads. Nevertheless, even at the 16-platter count, 
which is significantly higher than those in use today, we still 
find appreciable power savings even at high load conditions. 

While one may think that the need for storage-capacity 
increase over time may necessitate more platters, it is to be 
noted that this increase is usually achieved by denser media 
rather than by adding more platters. For instance, the IBM 
DFHS-Sxx (1994), 36ZX (1999), and 146ZI0 (2002) have 
storage capacities of 4.5 1 GB, 36.7 GB and 146 GB respec- 
tively, but their platter counts are 8, 1 0, and 6. Therefore we 
do not expect platter counts to increase significantly. 




Figure 7. Sensitivity to Number of Platters in 
the Disk Assembly 



5.3.2 Quadratic vs. Linear Power Model 

As was discussed in section 3, we considered both quadratic 
and linear scaling models-for the idle -power consumption- 
of the spindle motor at different RPMs. While the earlier 
results were presented with the quadratic model, we com- 
pare those results with the savings for DRPM per / with the 
linear model in Figure 8. We can observe that the differ- 
ences between these two models are not very significant, 
though the linear model slightly under-performs that of the 
quaaratic as is to be expected. This again confirms our ear- 
lier observations that the differences oetween a linear and 
quadratic model are not very different across these ranges 
of RPM values. Consequently, we find that DRPM per f* 
even with a conservative linear power scaling model gives 
better energy savings than TPM per f (compare with Figure 

6). 




Figure 8. Behavior of DRPM perf for a Power 
Model that relates the RPM and P idle linearly 



We have also conducted similar sensitivity analysis for 
other factors such as the step- size employed for the spindle 
motor and the type of RAID configuration. The interested 
reader is referred to [10] for the details. In general, we 
find that DRPM per f can provide significant energy sav- 
ings across a wide spectrum of disk and array configura- 
tions. 



6 A Heuristic DRPM Algorithm 

Having evaluated the potential of DRPM without any 
performance degradation, which requires an idle time pre- 
diction oracle, we next move on to describe a scheme that 
can be used in practice to benefit from this mechanism. The 
goal is to save energy using the multiple RPM levels, with- 
out significantly degrading performance (response time). 

In this scheme, (i) the array controller communicates a 
set of operating RPM values to the individual disks based 
on how performance characteristics (response time) of the 
workload evolve. More specifically, the controller speci- 
fies watermarks for disk RPM extremes between which the 
disks should operate; (ii) subsequently, each disk uses local 
information to decide on RPM transitions. 

Periodically each disk inspects its request queue to check 
the number of requests (N req ) waiting for it. If this num- 
ber is less than or equal to a specific value N m i n , this can 
indicate a lower loaa and the disk ramps down its speed by 
one step. It can so happen, that over a length of time the 
disks may gradually move down to a very low RPM, even 
with a high load, and do not move back up. Consequently, 
it is important to periodically limit how low an RPM the 
disks should be allowed to go to. This decision is made by 
the array controller at the higher level which can track re- 
sponse times to find points when performance degradation 
becomes more significantto ramp up the disks (or to limit 
how low they can operate at those instants). 

The array controller tracks average response times for n- 
request windows. At the end of each window, it calculates 
the percentage change in the response time over the past two 
windows. If this percentage change (AT reBp ) is 

• larger than an upper tolerance (UT) level, then the con- 
troller immediately issues a command to all the disks 
that are operating at lower RPMs to ramp up to the full 
speed. This is done by setting the LOW JVM (Low 
watermark) at each disk to the full RPM, which says 
that the disks are not supposed to operate below this 
value. 



• between an upper (UT) and lower (LT) tolerance 
level, the controller keeps the LOW JVM at where 
it is, since the response time is within the tolerance 
levels. 

• less than the lower tolerance level (LT), in which case 
the LOW JVM can be lowered even further. The spe- 
cific RPM that is used for the LOW JVM is calcu- 
lated proportionally based on how much the response 
time change is lower than LT. 

These three scenarios are depicted in Figure 9 which shows 
the choice of the LOW JVM for example differences in re- 
sponse time changes with UT = 15%, LT = 5%, and eight 
possible values for the LOW JVM. These are also the val- 
ues used in the results to be presented, and window sizes are 
n = 250, 500, 1000, thougn we have experimented with a 
more comprehensive design space. In our experiments, we 
set Njnin = 0, whereby the disks initiate a rampdown of 
their RPM based on whether their request-queue is empty 
or not. 




Figure 10. DRPM Heuristic Scheme Results. 
UT = 15%, LT = 5%, N min = 0. The results 
are presented for n = 250, 500, 1000, referred 
to as DRPM-250, DRPM-500, and DRPM-1000 
respectively. 




(a) (b) (c) 



Figure 9. The operation of the DRPM heuristic for UT = 15% and LT = 5%. In each figure, for the 
choice of low watermarks, the dotted line shows where LOW.WM is before the heuristic is applied and 
the solid line shows the result of applying the scheme. The percentage difference in the response 
times, ti and t 2 between successive n-request windows, diff, is calculated, (a) If diff > UT, then 
LOW_WM is set to the maximum RPM for the next n requests, (b) If diff lies between the two tolerance- 
limits, the current value of LOW_WM is retained, (c) If diff < LT, then the value of LOW.WM is set 
to a value less than the maximum RPM. Since diff is higher than 50% of LT but lesser than 75% of 
LT in this example, it is set two levels lower than the previous LOW.WM. If it was between 75% and 
87.5%, it would have been set three levels lower, and so on. 



6.1 Results with DRPM 



We have conducted extensive experiments to evaluate 
how well the above heuristic (denoted as simply DRPM in 
the rest of this paper) fares, not only in terms of its abso- 
lute energy savings and response time degradation, but also 
comparing it to the DRPM per f and static RPM choices 
(where non-DRPM disks of lower RPMs are used). The 
complete set of experimental results is given in [10] and we 
present the highlights here. 

The first set of results in Figure 10 show the energy sav- 
ings and response time degradation of our DRPM heuristic 
with respect to not performing any power optimization (re- 
ferred to as Baseline). The energy savings are given with 
both the quadratic and linear power models discussed ear- 
lier for two different inter-arrival times in each of the two 
distributions. Note that these are E to t savings, and not just 
those for the idle energy. 

We observe that we can get as good savings, if not 
better in some cases (especially with higher loads) than 
DRPM per f which has already been shown to give good 
energy savings. Remember that DRPM per f services re- 
quests at the highest RPM even if it transitions to lower 
RPMs during idle periods. This results in higher active en- 
ergy compared to the above heuristic which allows lower 
RPMs for serving requests, and also can incur higher transi- 
tion costs in always getting back to the highest RPM. These 
"effects are more significant at higher loads (smaller idle pe- 
riods), causing our heuristic to in fact give better energy 
savings than DRPM per /. At lighter loads, the long idle 
periods amortize such costs, and the knowledge of how long 
they are helps DRPM per f transition directly to the appro- 
priate RPM instead of lingering at higher RPMs for longer 
times as is done in the heuristic scheme. Still the energy 
savings for the heuristic are quite good and are not far away 
from DRPM per f y which has perfect knowledge of idle 
times. The results for the heuristic have been shown with 
different choices for n, the window of requests for which 
the LOW.WM is recalculated. A large window performs 
modulations at a coarser granularity, thus allowing the disks 



to linger at lower RPMs longer even when there may be 
some performance degradation. This can result in greater 
energy savings for larger n values as is observed in many 
cases. 

The response time characteristics of the heuristic are 
shown as CDF plots in Figure 10, rather than as an av- 
erage to more accurately capture the behavior through the 
execution. It can happen that a few requests get inordi- 
nately delayed 'while most of the requests incur very lit- 
tle delays. A CDF plot, which shows the fraction of re- 
quests that have response times lower than a given value 
on the x-axis, can capture such behavior while a simple av- 
erage across requests cannot. These plots show the Base- 
line behavior which is the original execution without any 
power savings being employed, and is also the behavior of 
DRPM P erf which does not alter the timing behavior of re- 
quests. The closeness of the CDF plots of the heuristic to 
the Baseline curve is an indication or how good a job it does 
of limiting degradation in response time. 

At higher loads, it is more important to modulate the 
RPM levels (LOW.WM) at a finer granularity to ensure 
that the disks do not keep going down in RPMs arbitrar- 
ily. We see that a finer resolution (n = 250 requests) 
does tend to keep the response time CDF of the heuristic 
close to the Baseline. In <Par,10> and <Par,50>, one can 
hardly discern differences between the Baseline and the cor- 
responding heuristic results. Remember that the Pareto traf- 
fic nas bursts of I/O requests"fol lowed bv longer idle peri- 
ods. Since our heuristic modulates the LOW.WM based 
on the number of requests (rather than time), this modula- 
tion is done fast enough during the bursts so that the re- 
sponse time of those requests are not significantly compro- 
mised, and is done slow enough during the longer idle peri- 
ods that the energy savings are obtained during those times. 
In the exponential traffic, while there are some deviations 
from the baseline, we are still able to keep over 90% of re- 
quests within a 5% response time degradation margin with 
a n = 250 window, while giving over 35% energy savings 
(in the guadratic model). Changing the power model from 
quadratic to linear does not change the trends as was pointed 
out earlier, and we still find over 25% energy savings. 



6.2 Controlling UT and LT for Power- 
Performance Trade-offs 

The DRPM heuristic provides two additionalparameters 
(in addition to n already considered) - UT and LT - for mod- 
ulating the RPM control. By keeping UT where it is, and 
moving LT up (closer to UT), we can allow the disks to 
transition to even lower RPM levels, thereby saving even 
more energy without compromising significantly onperfor- 
mance. This is shown by comparing the results for UT= 1 5% 
and LT=10% in Figure 1 1 (a) with those of the results in 
Figure 10 (at least for higher loads). 

Similarly, one can brine the UT parameter closer to LT, 
to reduce response time degradation without significantly 
changing the energy results. This is shown by comparing 
the results for UT=8% and LT=5% in Figure 1 1 (b) with 
those of the results in Figure 10. 

This heuristic thus provides an elegant approach for 
determining where one wants to operate in the power- 
performance profile. 






(a) 



(b) 



Figure 1 1 . Controlling UT and LT for Power- 
Performance Tradeoffs, (a) presents the re- 
sults for UT=15%,LT=10%. (b) presents the 
results for UT=8%,LT=5%. 



7 Issues in Implementing DRPM Disks 

Having demonstrated the power and performance poten- 
tial of DRPM, it is important to understand some of the ram- 
ifications in its physical realization: 

o Providing Speed Control 

As mentioned in section 3, speed control in DC brush- 
less PM motors can be achieved using PWM tech- 
niques. PWM achieves speed control by switching on 
and off the power supply to the motor at a certain fre- 
quency (called the duty cycle). The choice of duty cy- 
cle determines the motor speed. The design of such 
speed-control mechanisms can be found in [3]. 

o Head Fly-Height 

The height at which the disk head slider flies from 
the platter surface depends on the linear velocity of 
the spinning platter, u, which can be expressed as 
v = 2-Krfspin, where r is the radius of the disk and 
fspin is the frequency of rotation (measured in RPM). 
The fly height needs to be more or less constant over 
the entire range of linear velocities (RPMs) supported 
by the given spindle system. The Papillon sliaer pre- 
sented in [25] is capaole of maintaining this constant 
fly height over the range of RPMs that we have con- 
sidered. 

• Head Positioning Servo and Data Channel Design 

In hard-disks, positioning the head requires accurate 
information about the location of the tracks. This in- 
formation is encoded as servo-signals on special servo- 
sectors, that are not accessible oy normal read/write 
operations to the disk. This servo information is given 
to the actuator to accurately position the head over the 
center of the tracks. The servo information needs to 
be sampled at a certain frequency to position the head 
properly. As the storage density increases, the number 
of Tracks Per Inch (TPI) increases, requiring higher 
sampling frequencies. This sampling frequency is di- 
rectly proportional to the spinning speed of the disk 
/spin- Therefore, at lower f ap i n it might not be possi- 
ble to properly sample the servo information. [40] ad- 
dresses this problem by designing a servo system that 
can operate at both low and high disk RPMs along with 
a data channel that can operate over the entire range of 
data-rates over the different RPMs (the data rate of a 
channel is directly proportional to / sp » n )- 

• Idle-Time Activities 

Server environments optimize idle periods in disks to 
perform other operations such as validating the disk 
contents and optimizing for any errors ([19, 221). The 
frequencies of such operations are much lower than the 
idle times themselves to really have a significant con- 
sequence on the effectiveness of power saving tech- 
niques. Still, it is possible that DRPM may be more 
useful for such activities, since it allows those per- 
formance non-critical operations to be undertaken at 
a relatively slow RPM (for energy savings), while tra- 
ditional power mode control or transitioning the disk 
completely to a standby state prevents such activities. 

• Smart Disk Capabilities 

The anticipated smart disks [23, 331 provide an ex- 
cellent platform for implementing DRPM algorithms, 
and also provide the flexibility of modulating the algo- 
rithm parameters or even changing the algorithm en- 
tirely during the course of execution. 



The effect of RPM modulation on disk reliability needs 
further investigation. On the one hand, we have been in- 
creasing the number of disks in arrays to not only enhance 
performance, but also for availability. This in turn has ac- 
centuated the power problem, which this paper has tried 
to address. In doing so, it is conceivable that we may 
need more disks for not spares in case RPM modulation 
can worsen MTTFs. This vicious cycle between perfor- 
mance, power and availability warrants a further investiga- 
tion which we plan to undertake in the future. 



8 Concluding Remarks 

This paper has presented a new approach to address the 
growing power problem in large disk arrays. Instead of 
completely spinning down disks, which can incur signifi- 
cant time and power costs, this paper proposes to modulate 
the RPM of disks dynamically. The resulting DRPM mech- 
anism has been shown to find more scope forpower savings 
when idle times are not very long compared to traditional 
power management (TPM) techniques that have been pro- 
posed for laptop/desktop disks. In addition, it also allows 
the option or servicing reguests at a lower RPM when per- 
formance is not very critical, to provide additional power 
savings. Finally, it can be combined with TPM techniques 
to amplify the power savings. 

We have proposed timing and power models for the 
DRPM mechanism, and have conducted a sensitivity anal- 
ysis of different hardware parameters. In addition, we have 
presented a heuristic that can be used in practice to bene- 
fit from the DRPM mechanism to allow trade-offs between 
power savinjgs and performance benefits. Detailed simula- 
tions have shown that we can get considerable energy sav- 
ings without significantly compromising on performance. 

It is to be noted that the heuristic presented here is one 
simple way of using the DRPM mechanism though it is 
conceivable that one can optimize/change this further to get 
higher power savings, or to limit the performance degrada- 
tion. 
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